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Abstract 

Applications  such  as  animation  and  scientific  visualization  demand  high  performance  ren¬ 
dering  of  complex  three  dimensional  scenes.  To  deliver  the  necessary  rendering  rates,  highly 
parallel  hardware  architectures  are  required.  The  challenge  is  then  to  design  algorithms  and 
software  which  effectively  use  the  hardware  parallelism.  This  paper  describes  a  rendering  al¬ 
gorithm  targeted  to  distributed  memory  MIMD  architectures.  For  maximum  performance,  the 
algorithm  exploits  both  object-level  and  pixel-level  parallelism.  The  behavior  of  the  algorithm  is 
examined  both  analytically  r  nd  experimentally.  Its  performance  for  large  numbers  of  processors 
is  found  to  be  limited  primarily  by  communication  overheads.  An  experimental  implementation 
for  the  Intel  iPSC/860  shows  increasing  performance  from  1  to  128  processors  across  a  wide 
range  of  scene  complexities.  It  is  shown  that  minimal  modifications  to  the  algorithm  will  adapt 
it  for  use  on  shared  memory  architectures  as  well. 


A.- j  for 


DTi:  Ti.;. 


;  Jus  1 .ieat  1  oti. 


\ 

;  if. - 

■  ~  '  r Qu'  lor./ 

I  llanli.'itf  Codec 

i  {iTvall  &ad/«r 

Dlst  j  Speolfll 


This  work  was  supported  by  the  National  Aeronautics  and  Space  Administration  under  NASA  Contract  No. 
NAS  1-1 8605  while  the  authors  were  in  residence  at  ICASE. 

Authors’  addresses-.  Thomas  W.  Crockett,  ICASE,  M.S.  132C,  NASA  Langley  Research  Center,  Hampton,  VA  23G65; 
Tobias  Orloff,  Great  Northwestern  Graphics,  119  North  Fourth  Street,  Suite  206,  Minneapolis,  MN  55401. 

Electronic  mail:  tom'S'icase.cdu,  orlolf'Spoincare.geom. umn.edu. 


1 


1  Introduction 


Applications  such  as  real-time  animation  and  scientific  visualization  demand  high  performance 
rendering  of  complex  three-dimensional  scenes.  While  the  results  achieved  on  current  hardware 
have  been  impressive,  major  improvements  in  performance  will  require  the  use  of  highly  parallel 
hardware  and  scalable  parallel  rendering  algorithms.  This  paper  describes  one  such  rendering 
algorithm  for  MIMD  architectures.  Although  the  algorithm  is  designed  for  distributed  memory 
message  passing  systems,  straightforward  modifications  will  adapt  it  for  use  in  shared  memory 
environments. 

In  the  following  section,  we  introduce  the  traditional  rendering  pipeline  and  consider  the  issues 
involved  in  parallebzing  it.  Next,  we  present  our  algorithm  and  give  a  theoretical  analysis  of  its 
pcrformancs.  We  then  describe  an  implementation  on  the  Intel  iPSC/860  '  hypercube,  and  con.pare 
the  experimental  results  with  analytical  predictions.  Finally,  we  examine  how  the  algorithm  can 
be  adapted  for  shared  memory  MIMD  architectures. 

2  The  Rendering  Problen* 

We  assume  thi.t  we  are  given  a  scene  consisting  of  objects  described  as  collections  of  3D  triangles, 
some  fight  sour  :es,  and  a  viewpoint.  The  goal  is  to  produce  a  2D  representation  of  the  scene  taking 
into  account  the  lighting  and  perspective  distortion  (Fig.  1).  For  simplicity  we  assume  the  fights 
are  all  point  light  sources  and  the  triangles  possess  only  a  diffuse  coloring  attribute.  The  addition 
of  other  material  properties,  such  as  specularity  and  texture,  do  not  really  affect  the  main  structure 
of  the  algorithm. 

There  is  now  a  fairly  well  established  pipeline  for  the  fast  rendering  of  such  three  dimensional 
scones  [10].  The  standard  pipeline  may  be  represented  as  shown  in  Figure  2.  The  exact  sequence 
is  not  fixed,  for  example  shading  may  be  done  after  transforming  (indeed,  an  essential  portion  of 
Phong  shading  must  be  done  in  the  rasterizing  step  [3]),  or  clipping  may  be  delayed  until  after  the 
rasterization  step. 

One  way  to  parallelize  the  rendering  process  is  to  map  the  various  stages  of  the  pipoliiie  directly 
into  hardware  [2].  This  approach  has  been  very  successful,  and  has  been  adopted  by  a  number 
of  graphics  hardware  vendors.  But  the  ultimate  performance  attainable  by  directly  exploiting  the 
pilieliiie  is  limited  by  the  number  of  stages  in  the  pipe,  do  achieve  a  greater  degrc'c  of  i)arall('lism, 
other  strategies  must  be  examined. 

As  is  well  known  [11].  there  are  three  main  .'leps  in  the  reiubuing  process  which  account  for 
most  of  the  computation  time.  These  are 

1.  The  floating  point  operations  performed  on  objects,  such  as  transforming,  lighting,  and  dip 
ping. 

2.  The  rasterization  of  primitives  transformed  into  screen  coordinates. 

3.  Writing  pixels  to  the  frame  btiffer. 

(We  ignore  here  the  problem  of  traversing  the  database  prior  to  reiuh'ring.)  I'he  remhuing  time  will 
be  limited  by  the  slowest  of  these  three  steps.  Moreover,  in  current  scuial  and  pipelined  hardware 
itnpl(:„.ciiUitions,  eadi  of  ihe.si  three  steps  is  operating  at  its  limit  [1  ll.  riins  to  obtaiti  signilicant 
improvements  in  perfornuance,  it  is  necessary  to  map  the  rendering  |)i|U'line  onto  a  hardware  archi¬ 
tecture  in  which  each  of  these  three  steps  can  be  parallelized,  jireferablv  by  replicating  one  basic 

'  if’SC',  i/’.SY'/2,  iPSC/HCA),  .,..,1  iSOC  Ajf  Iradt-marks  of  Intel  Corporal  ion. 
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type  of  processing  element.  We  refer  to  parallel  computations  in  step  1  as  object  parallelism,  and  in 
steps  2  and  3  as  image  or  pixel  parallelism.  A  system  with  a  high  degree  of  object  parallelism  is  de¬ 
scribed  by  Torberg  in  [12].  A  system  with  a  high  degree  of  pixel  parallelism,  the  classic  Pixel-Planes 
system  of  Fuchs  and  Poulton,  is  described  in  [6].  Finally,  a  s3'stem  incorporating  both  object  and 
pixel  parallelism  is  described  by  Fuchs  et  al.  in  [7].  In  all  these  cases,  the  algorithms  for  3D  render¬ 
ing  are  mapped  onto  specific  hardware,  more  or  less  constructed  for  that  purpo.se.  In  our  case,  we 
map  the  rendering  algorithm  onto  more  general  purpose  parallel  architectures.  This  allows  us  to 
experiment  with  the  algorithm  at  a  high  level  and  with  a  high  degree  of  flexibility.  Once  the  critical 
performance  parameters  and  tradeoffs  are  thoroughly  understood,  then  special-purpose  hardware 
can  be  designed  to  achieve  maximum  performance.  As  we  will  show,  the  algorithm  described  in 
this  paper  achieves  both  object  and  pixel  parallelism,  and  will  run  on  <;y<:*oms  '-r.ptain’ng  from  1 
to  p  processors,  where  p  is  bounded  by  the  number  of  scanlines.  For  an  excellent  discussion  of  the 
various  approaches  to  object  and  pixel  paraUeUzation,  see  [11]. 

Besides  exploiting  both  types  of  parallelism,  a  good  algorithm  must  ensure  that  all  large  data 
structures  are  distributed  among  the  processors  without  wasteful  dupbeation.  In  our  case  there  are 
two  such  structures:  the  list  of  triangles  and  the  frame  buffer.  We  distribute  these  structures  evenly 
among  the  processors,  allowing  the  algorithm  to  scale  to  more  complex  scenes  and  higher  resolutions. 
Note  that  distributing  the  triangles  corresponds  to  object  parallelism,  while  distributing  the  frame 
buffer  corresponds  to  pixel  parallelism. 

3  Algorithm  Description 

To  describe  the  algorithm  we  first  specify  how  the  data  struct  iires  are  divided  among  the  i)rocessors: 

•  The  triangles  arc  distril)ulod  evenly  in  round-robin  fashion  Vo  all  processors. 

•  The  frame  buffer  is  divided  among  the  processors  by  horizontal  stripes  (Fig.  3). 

•  Small  data  structures,  such  as  the  lights  and  viewing  parameters,  are  replicated  on  each 
processor. 

The  distribution  of  tlie  frame  buffer  can  bo  modified  considerably  without  affecting  the  basic  struc¬ 
ture  of  the  algorithm.  Fssentially  all  that  is  needed  is  a  regular  geometric  division.  Wc  have 
implemented  only  a  division  into  horizontal  stripes,  which  seems  appropriate  for  rendering  into 
a  frame  buffer  of  size  102-1  x  1021  using  from  2  to  128  proces.sors.  The  effect  on  performance  of 
different  splittings  of  the  frame  buffer  is  an  interesting  topic  for  further  rc'search. 

With  the  above  distribution  of  data,  the  following  strategy  is  used: 

•  'I'he  shading,  transforming,  and  clipping  steps  are  performed  by  each  processor  on  its  local 
triangles. 

•  Before  rasterizing  a  triangle,  it  is  first  transformed  into  screen  coordinates,  then  split  (if 
necessary)  into  trajiezoids  along  local  frame  buffer  boundaries  (I’ig.  1).  Fach  trapezoid  is 
then  sent  lo  the  processor  which  owns  the  .segment  of  the  frame  buffer  in  which  it  lies. 

•  Upon  r('C('iving  a  trajiezoid,  a  given  jirocessor  rasterizes  it,  into  its  local  frame  buffer  using  a 
standard  z-buffer  algorithm[l]  to  eliminati-  hiilden  .surfaces. 

Ibr  simplicity,  triangles  which  lie  fully  within  a  single  franu'  buffer  segment  and  triangular  pieces 
of  split  triangles  are  treated  as  degenerate  trapezoids  in  which  two  of  the  vertices  happen  to  be 
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Viewer 


Figure  1:  A  scone  is  described  l)y  a  collection  of  objects  and  liglit  sources,  with  associated  viewing 
parameters. 


Figure  2:  A  typical  rendering  pipeline.  "Flie  dotted  line  <iivides  ob  ject-level  ;nid  pixel-h'vel  o[)era 
tions. 
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y  resolution 


Figure  3;  The  frame  buffer  is  distributed  across  processors  by  horizontal  stripes. 


Figure  4:  In  general,  triangles  must  be  split  at  frame  buffer  boundaries  and  the  pieces  sent  to  the 
correct  processors  for  rasterization. 


the  same  point.  To  reduce  communication  overhead,  trapezoids  destined  for  the  same  processor 
are  buffered  into  larger  messages  before  sending.  The  choice  of  buffer  size,  which  can  significantly 
affect  performance,  is  discussed  more  fully  later. 

The  algorithm  may  be  summarized  as  follows.  Each  processor  performs  the  loop; 

Until  done  { 

If  local  triangles  remain  { 

Select  a  local  triangle 

Shade  the  triangle 

Transform,  back  face  cull,  and  clip 

Split  into  trapezoids 

Put  the  trapezoids  into  outgoing  buffers 
When  a  buffer  fills  up,  send  its  contents 
If  this  is  the  last  local  triemgle  { 

Send  all  non-empty  buffers 

> 

} 

If  incoming  messages  exist  -C 
For  each  incoming  message  { 

Rasterize  all  of  the  trapezoids  in  the  message 

} 

} 

} 

To  avoid  having  to  store  largo  numbers  of  trapezoids  in  memory,  the  algorithm  alternates  between 
splitting  triangles  into  trapezoids  and  disposing  of  incoming  trapezoids  by  rasterizing  them  into 
the  frame  buffer.  It  is  not  obvious  what  the  proper  balance  is  between  these  two  activities.  If  n 
processor  concentrates  on  rasterizing  incoming  trapezoids,  it  may  starve  other  processors  by  not 
generating  enough  work  to  keep  them  busy.  Alternatively,  if  incoming  me.s.sages  are  not  flushed 
quickly  enough,  message  queues  will  fdl  up  and  outgoing  buffers  will  be  delayed.  Exiieriments  have 
shown  that  there  is  a  slight  advantage  in  processing  at  least  a  few  triangles  before  checking  for 
incoming  data.  Meyond  that,  the  algorithm  is  relatively  in.sensitive  to  this  choice. 

A  significant  feature  of  this  algorithm  is  the  absence  of  a  synchronization  point  in  the  loop. 
Processors  will  start  off  with  nearly  the  same  number  of  triangles,  but  several  factors  will  tend 
to  unbalance  the  workload.  First,  the  culling  and  clipping  step  requires  a  different  number  of 
operations  for  different  triangles,  and  may  cause  triangles  to  be  thrown  away,  or  to  be  subdivided 
into  several  smaller  triangles.  Next,  the  time  required  for  splitting  into  trapezoids  varies  with  the 
orientation  of  the  triangle  and  the  number  of  frame  buffer  boundaries  which  are  intersec  t('d.  l  lu' 
number  of  trapezoids  in  turn  affects  the  buffering  and  communication  times.  Similarly,  varying 
numbers  of  incoming  trapezoids,  along  with  differences  in  their  size  and  the  results  of  z-buffer 
comparisons,  will  cause  variations  in  the  rasterization  time. 

The.se  considerations  suggest  that  any  .synchronization  points  in  the  loop  will  introduce  signifi 
cant  amounts  of  idle  time,  since  each  iteration  of  the  loop  would  be  bound  by  the  slowest  procc's.sor. 
Instead,  our  strategy  is  to  let  individual  processors  proceed  as  asynchronously  as  iiossibh'.  Of 
course,  some  coordination  is  necessary  to  ensure  that  im*s.sage  buffers  are  correctly  pa.sscd  iroin 
one  processor  to  the  next.  Hut  the  use  of  an  asynchronous  message-passing  protocol,  comliiin'd 
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with  dual  send  and  receive  buflfers,  has  proven  effeciive  in  minimizing  idle  time  spent  waiting  for 
messages. 

However,  the  lack  of  a  synchronization  point  leads  to  dilFiculties  in  deciding  when  to  exit  the 
loop.  Even  after  a  given  processor  completes  work  on  it  local  triangles,  it  has  no  way  of  determining 
by  itself  when  it  has  received  the  last  incoming  message  from  another  processor.  We  use  the 
following  algorithm  to  detect  termination: 

1.  Each  processor  maintains  a  list  of  all  other  processors  to  which  it  sends  trapezoids.  We  refer 
to  these  as  neighbors  of  the  sending  processor.  Note  that  a  processor  may  be  a  neighbor  of 
itself. 

2.  After  the  last  local  triangle  is  processed,  a  processor  sends  a  Last  Trapezoid  {LT)  message 
to  each  of  its  neighbors,  indicating  that  there  will  be  no  more  work  forthcoming  from  that 
particular  source.  The  message  passing  protocols  must  preserve  message  order  so  that  LT 
messages  do  not  precede  the  trapezoids  to  whicli  they  refer. 

3.  When  a  processor  receives  an  LT  message,  it  replies  with  a  Last  Trapezoid  Complete  (LTC) 
message.  Receipt  of  an  I  TC  message  from  a  neighbor  indicates  that  the  neighbor  has  finished 
rasterizing  all  of  the  trapezoids  sent  to  it.  The  processor  records  this  fact. 

4.  When  LTC  messages  are  received  from  every  neiglibor,  a  processor  knows  that  all  of  its 
neighbors  have  finished  all  of  the  work  that  it  gave  to  them.  The  processor  then  produces  a 
Neighbors  Complete  (NC)  message  which  it  sends  to  a  specific  processor,  which  we  arbitrarily 
choose  to  be  processor  0. 

5.  When  processor  0  receives  an  NC  message  from  every  processor  (including  itself),  it  knows 
that  each  processor  has  finished  all  of  the  work  sent  to  it  by  all  of  its  neighbors,  and  that  the 
local  frame  buffer  segments  now  contain  the  final  image.  Processor  0  then  broadcast®  a  Global 
Completion  (GC)  message  to  all  processors.  Receipt  of  a  GC  message  notifies  a  processor 
that  rendering  is  complete  and  that  it  should  drop  out  of  the  loop. 

From  the  time  it  generate.s  LT  messages  for  its  neighbors  until  the  time  it  receives  a  GC  message,  a 
processor  must  continue  to  check  for  incoming  trapezoids  and  pror<'ss  them.  Note  that  for  a  given 
number  of  processors  p,  the  NC  messages  can  be  accumulated  in  time  O(logp)  using  a  parallel 
merge  algorithm,  rather  than  using  the  0{p)  method  described  above.  Similarly,  the  GC  broadcast 
can  be  done  in  time  O(logp),  or  even  0(1)  if  the  architecture  directly  supports  broadcasts. 

4  Performance  Analysis 

To  analyze  the  performance  of  the  algorithm  we  will  break  it  down  into  the  following  steps: 

•  Shading,  transforming,  culling,  and  clipping. 

•  Splitting  into  trapezoids. 

•  Sending  trapezoids. 

•  Rasterizing  trapezoi<ls. 

•  Storing  pi.xel  data. 

•  Wait  time. 
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•  Termination  algorithm. 


For  each  of  these  steps,  we  will  break  down  the  running  time  into  a  general  linear  part  and  an 
explicit  nonlinear  part.  The  linear  component  is  that  part  which  parallelizes  perfectly,  and  thus 
will  speed  up  linearly  as  the  number  of  processors  increases.  The  nonlinear  component  contains 
overheads  which  do  not  decrease  linearly  with  increasing  numbers  of  processors,  and  which  therefore 
detract  from  perfect  speedups.  Before  proceeding  we  introduce  the  following  notation: 

p  =  number  of  processors 
n  =  number  of  triangles 
y  =  height  of  the  frame  buffer  (in  scanlines) 
h  =  average  height  of  triangles  (in  pixels) 
d  —  trapezoid  buffer  depth  (in  trapezoids) 

T  =  number  of  trapezoids  generated  per  processor 
V  =  average  number  of  trapezoids  generated  per  neighbor 


We  assume  that  y  and  h  are  fixed,  y  is  a  multiple  of  p,  n  >  p,  and  that  the  triangles  comprising 
the  scene  are  uniformly  distributed  with  respect  to  our  splitting  of  the  frame  buffer.  Note  that  h 
is  the  average  triangle  height  on  the  projection  plane,  rather  than  in  world  coordinates.  The  linear 
part  of  the  running  time  is  a  term  of  the  form 
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where  the  constant  C  is  machine-  and  scene-dependent,  but  inde|)endent  of  n,  p.  and  d.  This  is 
the  contribution  to  the  running  time  that  parallelizes  perfectly.  The  nonlinear  part  of  the  running 
time  will  be  everything  else.  VV'e  will  attempt  to  determine  this  as  explicitly  as  possible  in  terms  of 
•the  above  variables  and  machine  dependent  constants. 


4,1  Shading,  transforming,  culling,  and  clipping 

Since  the  triangles  have  been  distributed  evenly  to  (he  j)roces.sors,  and  these  operations  may  be 
performed  independently  on  <■  m  h  triangle,  this  part  of  the  algorithm  contributes  only  a  linear  term 
to  the  running  time.^ 


4,2  Splitting  into  trapezoids 

Each  triangle  must  first  be  split  at  its  middle  vertex  (see  Figure  4).  Since  this  can  be  iierformed 
independently  for  each  triangle,  it  contributes  only  a  linear  term  to  the  running  time.  .\s  a  side 
effect,  this  split  effectively  doubles  the  number  of  triangles  to  2n  while  reducing  their  average  height 
to  /i/2.  Next,  there  is  a  certain  setup  cost  before  actually  dividing  the  triangle  into  trapezoids. 
Although  this  cost  would  not  be  incurred  in  a  serial  version  of  this  algorithm,  in  the  parallel  version 
it  still  contributes  only  a  linear  term  to  the  total  running  time.  This  cost  may  be  regarded  as  part 
of  the  parallel  overhead  of  the  algorithm.  Although  we  are  not  explicitly  isolating  the  parallel 
overhead  in  our  analysis,  it  does  in  fart  contribute  very  little  to  the  running  time,  so  that  the 
performance  of  the  parallel  algorithm  running  on  one  processor  is  virtually  identical  to  that  of  a 
serial  version. 

^Strictly  speaking,  l)ack  fare  culling  and  clipping  can  introduce  local  variations  in  workload  wliicti  will  delrail 
from  perfect  speedup.  But  since  we  are  a.s.snming  a  uniform  scene  for  purposes  of  analysis,  we  can  ignore  tins  elTect. 
Similar  variations  can  he  introd'iced  in  the  r3st'--i7?.*'!'~e  and  z-lnitfer  computations  ;.  ,.  ..ill  likewise  he  ii;nored.  In 
practice,  the  impact  of  these  variations  is  scene-depenileiit. 


A  nonlinear  contribution  to  the  running  time  results  from  actually  splitting  the  triangle.  In 
loose  terms,  the  more  processors  wo  have  the  more  we  must  split  the  triangle,  so  that  adding 
processors  increases  the  number  of  trapezoids  in  the  system,  lb  f|uantify  this,  one  easily  computes 
that  a  triangle  in  the  projection  plane  crosses  a  local  frame  buffer  boundary  line  on  average  hpl'lij 
times.  Since  back  face  culling  will,  on  average,  eliminate  half  the  original  triangles,  the  number  of 
resulting  trapezoids  per  processor  is 


T  = 


nh  71 
d — 
p 
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and  the  time  to  split  n  triangles  among  ?;  processors  is  simply  r/jp/,,,  where  /.,p;,,  is  tlu'  time  for 
one  split.  VVe  can  fiirther  analyze  /.,,,/,(  by  counting  the  arithmetic  o]U'rations  pi'rformed.  fhe 
actual  time  will  of  course  depend  on  the  precise  assembly  rode  generated  and  the  charactc'iistics 
of  the  processor.  In  our  current  imi)lem<'ntation,  one  split  re(,uiros  ITj  integer  adds  and  10  integer 
compares. 


4.3  Sending  trapezoids 

'fo  a  first  approximation  we  assume  that 


•  A  single  processor  sending  several  messages  must  do  so  on*'  at  a  time. 

•  Multi|)le  processors  can  be  sending  simultaneously. 


•  A  processor  does  not  incur  communication  overheads  for  messages  to  itsrdf. 


d'he  second  .assumption  in  particular  is  .somewhat  (piestionable  edge  contention  among  comprU- 
ing  sends  can  .seriously  impair  message  pa.ssing  performance,  as  shown  in  [1].  This  point  will  be 
discussed  more  fully  in  later  sections. 

Communication  time  can  be  divided  into  two  independent  parts,  a  fixed  overhead,  or  l(i(ci7nj,  ti. 
and  a  transfer  cost  (<.  The  latency  includes  various  software  overheads  and  hardware  dedavs,  effects 
from  network  contention,  etc.  This  is  incurred  on  a  per  message  basis.  The  transfer  cost  is  just  the 
inverse  of  the  network  bandwidth  multii)lied  by  the  total  number  of  bytes  to  be  communicated. 

A  processor  will,  on  average,  generate  v  =  r/p  trapezoids  for  each  of  p  destinations,  including 
itself.  If  tb  is  the  per- byte  transfer  cost  and  .s  is  the  size  of  a  trapezoid  in  bytes,  then 

tt  =  (p  -  l)vslb  (3) 


faking  into  account  buffering,  the  number  of  messages  m  generated  by  each  processor  is 
v' 

m  =  (p-  \  )  - 

and  the  total  time  for  sending  trapezoids  is  simply 


1) 
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4.4  Rasterizing  trapezoids 

Since  each  pixel  of  each  trapezoid  is  rasterized  exactly  once,  and  l!iis  work  is  split  equally  atiion^ 
the  processors,  it  would  appear  that  this  part  of  the  algorithm  is  linear.  However,  hy  sjilitting 
the  triangles  into  trajtezoids  we  itirnr  an  overliead  for  each  trape/.oid  |>rior  to  tasteii/.ation.  I  he 
rasterization  step  essentially  consists  of  sev<>ral  ap|)lications  of  the  Hresfuiham  linear  inter[)olat ion 
algorithm  [.5],  once  in  the  vertical  direction  ami  once  j)er  scaiilitie  in  the  horizontal  dir<(tion. 
The  overhead  is  incurred  in  the  vertical  application  of  the  Hresenhatii  algorithm,  which  tnitst  he 
performed  for  every  trapezoid.  Therefore  the  nonlinear  cotitribulion  to  the  nintiitig  tittie  is  ~ln. 
where  Ib  is  the  startup  cost  for  t  he  Hresenham  algorit  hm.  In  terms  of  itif eger  arit  lirnr  t ic  opc'r.at inuv, 
ts  is  5  divides,  10  multiplies,  and  20  adds. 

4.5  Storing  pixels 

The  z-buffer  compare  and  conditional  store  operations  are  perfectly  distrihtited  atnotig  the  proces 
sors,  so  this  computation  contributes  only  a  linear  term  to  the  running  time. 

4.0  Wait  time 

The  rendering  algorithm  as  viewed  by  a  single  processor  consists  of  two  distinct  jthases.  During 
the  first  phase,  the  processor  alternates  between  processing  triangles  and  rasterizing  trapezoids. 
If  no  trapezoids  have  arrived  duritig  a  loop  iteration,  tlie  proce.ssor  can  keep  bus\  by  prores>iiirt 
tnore  triangles  duritig  the  subsetjiient  iteration.  In  the  second  phase,  all  local  triatigles  hav('  bemi 
processed  and  the  proci’ssor  [tolls  for  iticoming  lra|>e/oids  until  a  (Ilobal  ( ’om|)let ictn  ((!(’)  tnes^age 
arrives.  During  this  second  [thase.  the  procc'ssor  will  be  i<ile  if  trapevoids  full  to  arri\e  at  lea.-t  a^ 
fast  as  they  can  Ix'  rastr'rized.  f'urt liermore.  no  [troct'ssor  can  ti'rminate  uniil  the  slowest  one  has 
fitiished. 

■A  [trecise  t  real  ment  of  I  his  sit  uat  ion  retpiires  ati  excursion  int o  ipieueing  theory,  whii'h  is  beyond 
lilt'  scope  of  this  p-apt'r.  Iiistetid,  wt-  presmil  an  argument  based  on  the  rajiacity  of  the  lonimu 
nicattoti  netwrtrk  which  approxiitiat<>s  t  lu'  ol)^.ervetl  uuformance  of  the  tilgorithtn.  Our  .irgumeiit 
asstitttes  that  perfortttatice  is  corttttiutiication  bound,  rather  than  cotttpute  boutid.  lor  purposes 
of  ('X[)Osition  We  will  us*'  ;i  hy()ercube  iirchilm  t  lire.  .\  .similar  .inalv  sis  could  be  applied  to  othei 
cotnitiutiicatiott  ttelworks. 

When  a  ftrocessor  C(itit|)letes  its  last  triangle,  it  must  Hush  its  ptiitially  Idled  trape/uld  bullets. 
Since  oite  of  I  he  goals  of  ottr  algorit  It  tit  is  to  <  onserve  tnemory,  we  will  assuttie  t  h<it  d  v  e  /  2.  in  w  hii  h 
case  outgoing  buffers  will  contain  on  averag*-  i//2  Irape/oids  remaining  to  be  sent.  II  we  assume 
a  tittifortti  scent',  ihi'ii  jiroct'ssors  will  reat  h  this  slate  at  more  or  less  the  sattte  tittie.  Iheiehui' 
the  entire  system  |■ontains  p( p  1  )  messages  of  aver.ige  length  d/J  whn  h  will  be  !ii|ei  l-'d  into  the 
network  at  jibotil  the  saint'  time,  lieiaiist'  o|  edge  ctinteiition.  these  messag,es  (annot  all  be  sent 
simulf aneoiislv.  lor  a  Ityjierrube  of  size  j>  --  2^  juocessors,  the  averagi-  dist.ince  .i  message  must 
travel,  anti  I  herefore  the  nu  tuber  of  edges  it  ties  up.  is  A’/i/2|  p  1 ).  Since  the  i  oi  ,il  ti  umber  ol  edges 
in  a  hyperciibe  (assuming  iiniditect  ional  commuiiicat  ion  t  is  ;d'/2  we  have  a  baiulwidth  dehcii  ol 
t  he  order 

~  ' )  2  ^  P<1  i  i;i 

~  2 

2 

rhiis  wail  lime.  ij'  roughly  proportional  to  the  shortage  of  i  ominunii  .it  ion  (a|>a<ii\; 


In  a  subsequent  section,  we  will  determine  a  empirically  for  a  particular  imi)lementation. 


4.7  Termination  algorithm 

riie  termination  algorithm  requires  each  pair  of  neighbors  to  exchange  messag('s,  followed  by  a 
global  merge  step  and  a  broadcast.  The  time  required  for  these  last  twx)  operation.s  dei)en(ls  on  the 
architecture  of  the  interconnection  network.  If  we  assume  a  hvj)errube.  iIkmi 

<,u<(  =  2[(p-  1)  +  log2P]f;  (S) 

There  is  no  per-byte  cost,  since  these  messages  are  used  only  as  signals  and  contain  no  data.'^ 


4.8  Total  time 


Combining  all  of  the  above  contributions,  we  find  the  total  running  time  t  for  the  algorithm: 

t  —  C  T  T  ^send  "b  "b  ^wait  “b  (b) 

P 

Substituting  and  rearranging,  we  get 


t  —  C~  +  T  {t,pUt  +  Ih)  +  (p  —  1 )  vsth  +  2  [(/>  —  1 )  +  log2  7^]  ~  1 ) 


/;  +  o 


7"/ 


(101 


In  section  6  we  present  the  experimental  results  for  a  particular  imphunent allon  of  this  algor'nlmi. 
and  com{)are  tho.se  results  with  predictions  from  the  analytical  model,  first,  we  (h'sciibe  some 
pertinent  details  of  our  imith'inenlation. 


5  iPSC/860  Implementation 

We  hav('  im|)lemente(l  the  above  rt'iidering  algorilhtn  in  th<'  (’  language  on  tin'  Intel  irSC/J  ami 
iI*S('/S(i()  hyperculx'  computers.  .Ml  of  the  <'xperiment.s  (h'scriln'd  beiow  wc'ic'  pt-rfornu'd  on  the 
latter  system.  'I'he  actual  implementation  differs  slightly  from  tin'  algorithm  (h'seribed  above,  in 
that  shading  calculations  are  pulh’d  out  of  tin*  main  loop  and  doin'  as  a  prc'proci'.ssing  slej).  (  I  Ids 
is  advantageous  if  a  shaded  scene  will  be  displayed  rejieatedly  nsitig  different  viewing  iiarameteis,  i 
( 'onse(|uent  ly,  the  ri’inh'ring  rati's  (piotf'd  below  do  not  iinlinh'  the  time  for  shading.  Ht-member 
that  the  shading  stej)  jiaralleli/es  jierh'ctly,  and  so  would  otdy  improve  the  obsi'ived  proicssei- 
utilization.  We  should  also  note  that  the  il’SC  systems  do  not  currt'utlv  provide  a  ginqdiical  display 
(h'vice,  so  r('ud<'ring  rate's  do  not  re'flect  the  final  display  stej)  of  the  jiipi'llin'  in  figure  2,' 

Our  sample  impleiin'iitatioti  incorporates  a  standar.l  scaiiliin'-base'd.  z-bnlfert'd  t  riangh'  n'lnb'ri'i  . 
'fhe  shading  calculations  take  into  a<  count  diffuse  and  ambient  lighting  components  at  the  triangle 
vertices,  and  the  raste'rization  procc'ss  smoothly  interpolates  t hesi'  value's  ae  reiss  the-  tri.ingle'.  We- 
use  8  bits  for  each  of  the  red,  green,  anel  blue  color  channels.  21  bits  feir  the'  z  bnllVr.  ami  |)l\e’l 
|)e>sitions  are*  maintained  tei  a  subiiixi'l  accuracy  of  one  part  in  (II.  flie'  e  nrre-nt  im|)le'mentation 
makes  little  attempt  to  optimize'  the'  graphics  ceiele-  ferr  the'  iSflO  preree-sseir  e  hip  e'ni()le)yed  in  i  he- 
iPSC/860.  d'he  coeh'  is  writte'ii  e'litirely  in  a  scalar  (as  eepjreise'el  to  ve'etor)  style-,  ami  nee  use-  has 

'Actieallv,  tile’  imi.st  al  Ica-sl  (luivcy  el.s  Ivpi-.  Iiiil  we  a.ssiiiiii-  lliat  all  mk  .ss,ij;is  iinilaiii  type  iiilerrii.il  nai 

wliieli  we  ineliiele  a-s  part  of  tlie’  latency,  h. 

^Oier  eeirrent  prae  tir  e  is  to  nierj;e  the  finislieil  aotitenls  of  llie  loi  at  fralii'  Iniffer  .se^nients  nilo  ,i  lih  h  i  l.ili  r 
viewill^  ottline.  Oiir  einphiLsis  here  is  on  llie  liehavior  ol  the  parallel  reiulennn  al>;oi  il  h  in ,  r.ilhei  ih.iii  cn  tin  ii'e  .1 
the  iP.St’  a.s  a  remlerinn  engine 
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Figure  5:  Buffer  busy  ratio  vs.  buffer  depth. 

been  made  of  the  built-in  graphics  features  of  the  i860.  In  addition,  (he  compilers  available  to 
us  exploited  few  of  the  high  performance  capabilities  of  the  i860,  such  as  pipelining  and  dual 
instruction  mode.  With  better  compilers  and  some  tuning,  the  performance  of  the  graphics  code 
should  increase  substantially,  leaving  communication  and  I/O  as  the  limiting  factors  on  rendering 
speed. 

In  contrast  to  the  graphics  computations,  we  have  gone  to  some  lengths  to  optimize  message 
passing.  The  iPSC  operating  system  provides  asynchronous  routines  for  both  message  sending 
and  receiving,  which  can  be  used  to  overlap  message  transfers  with  otlier  computations.  We  have 
taken  advantage  of  these,  in  conjunction  with  a  double-buffering  scheme,  to  hide  most  of  the 
overhead  associated  with  message  transfer  time  (ij)  as  well  as  much  of  the  edge  contention  delays. 
One  measure  of  overlap  is  the  number  of  times  processors  must  wait  when  inserting  trapezoids  into 
outgoing  buffers  because  the  buffers  are  still  busy  from  a  previous  send.  Figure  5  shows  this  number 
expressed  as  the  ratio  of  total  buffer  busy-waits  to  the  total  number  of  trapezoids  generated  across 
aU  of  the  processors.  The  values  plotted  are  for  buffer  sizes  ranging  from  2  to  u/2  with  varying 
numbers  of  processors,  using  our  standard  test  scene,  described  in  tlie  next  section.'^  F'ach  data 
point  is  the  mean  across  five  runs.  It  can  be  seen  tliat  the  overlap  strategy  is  very  successful.  In 
all  cases,  for  d>  3,  more  than  99.5%  of  the  trapezoids  generated  were  able  to  be  placed  in  buffers 
immediately. 

*On  the  iPSC/860,  different  mess&ge  passing  protocols  are  employed  for  short  messages  (<  100  bytes)  es,  long 
messages  (>  100  bytes).  Since  our  trapezoid  data  structure  is  64  bytes  long,  buffers  of  depth  1  have  different 
performance  parameters  than  larger  ones.  For  simplicity,  we  limit  our  analysis  to  buffer  sizes  >  2. 


II 


6  Performance  Results 


In  this  section  we  present  experimental  results  from  the  iPSC/860  implementation  of  our  algorithm 
and  compare  them  to  predictions  based  on  our  performance  model.  Our  standard  test  scene  is 
composed  of  100000  10  x  10  pixel  triangles  in  random  orientations  (Fig.  6).  This  scene  was  chosen 
since  it  statistically  approximates  a  uniform  scene  for  purposes  of  comparison  with  the  performance 
model.  In  all  cases,  we  enable  back  face  culling  so  that  the  number  of  triangles  actually  drawn 
is  about  half  of  the  total.  The  scone  is  romh'red  with  a  frame  biifh'r  resolution  of  512  x  512. 
The  average  triangle  height  on  the  projection  jilane  as  measured  by  the  reiuh'ier  is  li  =  S.:{.  To 
determine  the  effects  of  scene  complexity,  we  modify  the  standard  scene  by  varying  the  number  of 
triangles  while  holding  the  triangle  size,  in  pixels,  constant.  Unless  otherwise  noted,  performance 
figures  arc  mean  values  across  five  runs. 

6.1  Sensitivity  to  buffer  depth 

As  mentioned  previously,  the  seh'ction  of  buffer  depth  can  have  a  significant  impact  on  performance. 
Figure  7  shows  scatterplots  of  rendering  time  r.s.  buffer  depth  for  our  standard  scime,  wit  h  p  ranging 
from  8  to  128.  (Because  of  memory  requirements,  a  minimum  of  8  iiroces.sors  are  needed  to  render 
the  standard  scene.)  Again,  d  ranges  from  2  to  t'/2.  'I'he  sensitivity  to  d  can  be  readily  understood 
in  ti'rnis  of  the  performance  model.  If  d  i.s  small,  then  the  ratio  o/d  in  Kquation  5  is  large  and  the 
costs  due  to  message  latency  an-  high.  If  d  is  large,  latency  is  reduced  but  wait  time  due  to  network 
congestion  increases  (Fq.  7).  For  sufficiently  large  d,  our  algorithm  is  eipiivalent  to  a  simpler  two- 
phase  version  in  which  (1)  all  triangles  are  first  split  into  trapezoids  and  tlu'  trapezoids  are  stored 
in  memory,  then  (2)  trapezoids  are  sent  to  tlnur  ih'stinations  and  rasterized.  It  is  clear  from  the 
performance  modi-l  that  this  sim|d('r  algorithm  not  only  wastes  rnmnory.  but  also  maximizes  edg'’ 
contention  by  injecting  all  of  the  trapezoid  data  into  the  ni'twork  at  onci'.  By  using  smaller  buffer 
sizes  and  allowing  splitting  and  rasterization  to  proceed  together,  our  algorithm  not  only  consmves 
memory,  but  spreads  the  communication  load  over  a  longer  period  of  lime. 

I'or  best  performance,  we  would  like  to  be  able  to  predict  an  o|)timnm  biiffi'r  size,  dopi.  without 
having  to  resort  to  a  long  .series  of  test  runs.  If  we  know  something  about  the  scene,  such  as  t.  or 
n  and  ft,  thmi  the  performance  model  can  be  usc-d  to  d<‘termine  a  near-optimal  value  for  d.  (Uecall 
that  V  =  t/]>.)  If  we  lake  the  (h'rivative  of  Kipiatioii  10  with  rt'spei  t  to  d  (ignoring  the  ceiling 
function)  and  solve  for  the  minimum,  we  get 


The  case  where  r  is  unknown  in  discussed  in  Section  6.1  in  the  context  of  non-uniform  scenes.  We 
now  turn  our  attention  to  values  for  and  o. 

6.2  Message  latency  and  wait  time 

Experimental  measurements  of  message  latency  on  tin'  iPSC/gfiO  havi'  tyi)ically  been  done  under 
carefully  controlled  test  conditions  in  order  to  get  consistent  results.  Because  our  algorithm  i.s  very 
dynamic,  and  because  we  include  contributions  due  to  buffer  management,  published  values  for 
message  latency  are  not  directly  ai)|)Iicable.  In  addition,  our  simplistic  analysis  of  wait  time  does 
not  yield  a  value  for  the  proiiortionality  constant  a.  These  considerations  lead  us  to  determine  the 
values  of  //  and  o  mnpirii  ally.  We  recast  Equation  10  as  a  function  of  d: 

1(d)  rr  +('.,d  (1'2) 

d 
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Figure  7:  Execution  time  as  a  function  of  buffer  depth  for  the  standard  test  scene,  d  =  2, . . . ,  v/2. 
where 

Co  =  C- +  + /b)  + 2  [(p- 1)  + log2  ;>]//„  (13) 

P 

Cl  =  (p-l)uti  (M) 

C,  =  ^  (15) 

For  convenience,  we  again  ignore  the  ceiling  function.  Because  of  the  high  degree  of  overlap  achieved 
between  communication  and  computation  in  our  implementation,  wo  have  also  dropped  the  term 
for  data  transfer  time  from  Equation  3.  Finally,  we  substitute  for  ti  in  the  contribution  from 
the  termination  algorithm  to  reflect  the  differing  protocols  for  short  and  long  messages.  We  can 
now  do  a  least-squares  fit  using  the  data  from  our  standard  tost  scene  to  determine,  approximately, 
the  values  of  the  coefficients  Cq,  (7i,  and  C2,  and  then  solve  for  t-i  and  a.  The  results  are  shown  in 
Table  1. 

The  data  suggest  that  ti  and  a  are  not  constants,  but  are  instead  functions  of  p,  or  more  specifically, 
k,  where  k  =  log2p-  However,  the  limited  sample  size  does  not  allow  any  firm  conclusions  to  be 

drawn,  and  in  the  absence  of  a  theoretical  basis  for  determining  the  form  of  the  functions,  we  have 

chosen  to  use  the  mean  values. 

6.3  Measured  vs.  predicted  performance 

'lb  further  explore  the  parallel  |)erformance  of  our  algorithm  and  to  validate  the  analytical  model, 
we  varied  the  com|)lexity  of  the  random  triangle  scene  from  6250  to  200000  triangles  in  multiples 
of  2.  For  each  scene,  p  ranged  from  the  minimum  allowed  by  memory  requirements  up  to  128  in 
powers  of  2,  using  the  optimal  buffer  depth  predicted  by  Equation  1 1  (Table  2).  Figure  8a  shows  the 
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p 

Time  in  ps 

U 

a 

8 

452 

113 

16 

443 

120 

32 

419 

152 

64 

411 

185 

Mean 

431 

143 

Table  1:  Empirical  values  of  message  latency  and  wait  time  for  the  standard  test  scene. 


H 

Predicted  Optimal  Buffer  Depth 

12500 

25000 

50000 

100000 

200000 

2 

69 

98 

138 

- 

- 

- 

4 

43 

85 

121 

- 

- 

8 

23 

33 

47 

66 

94 

- 

16 

12 

18 

25 

35 

50 

71 

32 

7 

9 

13 

19 

27 

38 

64 

4 

10 

15 

21 

128 

2 

■1 

6 

9 

12 

Table  2;  Predicted  buffer  sizes  for  several  random  triangle  scenes. 


observed  rendering  rates  for  each  scene.  The  results  shew  that  performance  continues  to  increase 
as  processors  are  added,  even  for  the  smallest  scene,  although  large  numbers  of  processors  are  most 
effective  for  more  complex  scenes. 

The  usual  measure  of  effectiveness  of  a  parallel  algorithm  is  speedup,  defined  as  the  time  to 
execute  a  problem  on  a  single  processor  divided  by  the  time  to  execute  it  on  p  processors.  In  our 
case,  only  the  smallest  test  scenes  can  be  run  on  a  single  processor  due  to  memory  limitations, 
so  traditional  speedups  cannot  be  computed  directly.  Instead,  we  normalize  performance  across 
scenes  by  comparing  the  rendering  rates,  instead  of  the  execution  time,  and  use  these  to  estimate 
speedups.®  We  define  the  performance  level  for  p  =  1  to  be  the  rendering  rate  of  the  largest 
test  scene  which  would  fit  on  a  single  processor,  which  was  4366  triangles/second  for  n  =  12500. 
Table  3  shows  speedups  relative  to  this  case.  Speedups  on  large  numbers  of  processors  (64  and 
128)  are  poor  primarily  due  to  communication  costs  {t,end  ‘•■nd  which  are  the  dominant 

overheads.  As  p  decreases,  the  trapezoid  costs  and  tg)  become  the  primary  overheads,  and 

speedups  are  reasonable  on  moderate  numbers  of  processors  (16  and  32).  Figure  9  shows  the  relative 
contributions  of  the  individual  terms  in  the  performance  model  for  our  standard  test  scene.  On  the 
plot,  and  Ib  have  been  combined  into  a  single  term,  fjrop-  Note  that  the  contributions  for  t,f.nd 

‘We  consider  our  normalized  speedup  computations  to  be  just  estimates  for  two  reasons:  (1)  As  the  density 
of  the  random  triangle  scenes  increases,  a  larger  proportion  of  the  z-buifer  comparisons  will  fail  because  pixels 
are  obscured  by  other  triangles  which  lie  closer  to  the  viewer.  This  results  in  a  lower  percentage  of  frame  buffer 
stores  and  slightly  reduced  computational  cost  per  pixel.  (2)  Because  of  the  suspected  effects  of  caching  (described 
subsequently),  execution  times  on  small  numbers  of  processors  may  not  be  directly  comparable  to  those  on  larger 
numbers  of  processors.  This  effect  could  be  mitigated  by  comparing  performance  at  constant  values  of  n/p,  but  the 
compensation  is  only  partial  since  the  size  of  the  frame  buffer  segments  is  independent  of  the  number  of  triangles. 
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(a)  (h) 


Figure  8:  (a)  Observed  rendering  rates  for  several  random  triangle  scenes,  (b)  Observed  and  pre¬ 
dicted  rendering  times  for  n  =  12500,  50000,  and  200000.  Solid  lines  are  the  predicted  performance. 


and  tujait  are  roughly  equal  because  the  buffer  size  was  chosen  on  tlie  basis  of  Equation  11.  TIk' 
divergence  of  tg^nd  and  at  large  values  of  p  illustrates  the  importance  of  the  coiling  function 
in  Equation  10,  a  contribution  which  was  ignored  in  order  to  derive  Ecpiation  1 1. 

VVe  note  in  passing  an  interesting  phenomenon  in  the  speedup  data.  At  certain  points  in  tlie 
table  (shown  in  bold  type),  dramatic  increases  in  performance  are  observed  from  one  value  of  p 
to  the  next.  Since  these  points  occur  at  fixed  values  of  n/p,  we  conjecture  that  they  are  due  to 
caching  on  the  i8G0  processor.  As  p  increases,  the  size  of  several  data  structures  (triangles,  frame 
buffer  segment,  message  buffers)  decreases,  which  may  result  in  better  cache  hit  ratios. 

In  Figure  8b,  we  compare  the  observed  and  predicted  performance  of  several  test  scenes,  fo 
predict  performance  using  our  model,  we  must  first  determine  the  value  of  the  scene dependent 
constant  C.  This  is  done  by  taking  the  observed  rendering  time  on  some  number  of  processors  p  and 
solving  Equation  10  for  C.  We  have  chosen  the  entries  lying  along  the  boldface  diagonal  in  fable 
as  the  points  at  which  to  solve  for  C  (points  of  constant  n/p).  We  also  need  values  for  tg,  and 

tig.  Ba.sod  on  the  operation  counts  from  Section  A  and  timing  information  from  [8,  9],  we  estimate 
that  tgpiit  =  2.500  ps  and  tg  =  13.375  /<s.  Since  communication  in  the  termination  algorithm  uses 
synchronous  (non-overlapped)  message  passing  routines  and  incurs  very  little  overhead  beyond  the 
actual  message  transmission,  we  use  published  latency  data  [1]  to  set  tig  —  75  /ts.  As  the  i)I(jl 
shows,  our  model  successfully  predicts  the  general  performance  trends.  Some  discrepancies  occur 
for  small  p  where  the  suspected  caching  perturbations  occur,  and  the  model  underestimates  slightly 
the  overheads  at  large  p.  This  lends  credence  to  our  previous  observation  that  a  is  an  increasing 
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m 

Speedup 

6250 

12500 

25000 

50000 

100000 

200000 

1 

0.6 

1.0 

- 

- 

- 

- 

2 

1.7 

1.2 

2.0 

- 

- 

- 

4 

3.2 

3.4 

2.5 

3.8 

- 

~ 

8 

5.8 

6.2 

6.4 

5.1 

7.5 

- 

16 

9.9 

10.7 

11.6 

12.3 

10.5 

14.2 

32 

13.1 

16.1 

18.6 

20.1 

20.7 

25.5 

64 

15.3 

18.6 

23.4 

28.0 

32.8 

40.1 

128 

18.8 

21.2 

25.6 

31.4 

37.4 

50.6 

Table  3:  Speedup  estimates  derived  from  observed  rendering  rates.  Boldface  entries  indicate  une.x- 
pectedly  large  performance  increases. 


Figure  9:  Predicted  contributions  of  individual  components  of  the  performance  model  for  our 
standard  test  scene,  ttrap  =  +  /g). 
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function  of  k,  although  other  terms  may  be  involved  as  well. 


6.4  Performance  on  non-uniform  sceites 

Although  our  random  test  scene  is  useful  for  analyzing  our  algorithm,  it  is  not  a  very  representative 
application.  To  obtain  a  better  feel  for  performance  on  more  realistic  scones,  wo  ran  exiicriments 
with  two  additional  test  cases,  shown  in  Figures  10-11.  The  first  scene,  which  we  will  refer  to  un 
Plato,  contains  a  large  number  of  small  triangles  with  the  density  ol  triangles  varying  from  jilare 
to  place  in  the  scene.  The  second  scene,  designated  LDEF,  contains  a  wide  range  of  triangle  sizes, 
which  is  very  effective  at  desynchronizing  the  processors  because  of  differences  in  rasterization  time. 
Both  scenes  were  rendered  at  a  resolution  of  512  x  512. 

The  first  issue  we  address  is  that  of  picking  a  buffer  size.  Figure  12  shows  rendering  time  as  a 
function  of  buffer  size,  where  d  varies  from  2  to  1.25f.  In  contrast  to  the  uniform  scene  shown  in 
Figure  7,  the  optimal  buffer  sizes  for  both  of  those  scenes  occur  at  much  larger  values  of  d.  I'hns 
Equation  11  is  not  applicable  because  the  processors  are  much  farther  out  of  sync  and  the  final 
buffer  flushes  arc  spread  out  in  time,  reducing  the  effect  of  twait-  lu  the  absence  of  an  analytical 
prediction  for  a  good  buffer  size,  we  note  that  o/2  works  well  in  many  cases.  Other  experiments 
have  shown  that  for  small  values  of  n/p,  increasing  the  buffer  depth  to  around  o  offers  additional 
performance  gains,  a  trend  hinted  at  in  Figure  12. 

If  T,  and  hence  v,  are  unknown,  then  the  best  we  can  do  is  hazard  a  guess.  A  buffer  dejith 
of  10-100  seems  like  a  good  starting  point  since  it  reduces  latency  costs  by  one  to  two  orders  of 
magnitude.  As  a  rule,  buffer  depth  should  decrease  with  increasing  p.  If  a  scene  will  be  rendered 
repeatedly  with  minor  changes  in  the  viewing  parameters  from  frame  to  frame,  as  in  an  animated 
sequence,  then  the  rendcrer  can  automatically  adjust  the  buffer  size.  For  the  first  frame  an  initial 
guess  is  needed.  For  subsequent  frames,  the  observed  value  of  r  from  the  previous  frame  is  used  to 
derive  a  better  guess  for  d. 

Figure  13  shows  rendering  rates  for  the  LDEF  and  Plato  scenes  using  a  buffer  depth  of  v/‘2. 
Because  of  memory  limitations,  neither  of  these  scenes  could  be  rendered  with  a  single  processor  at 
512  X  512  resolution.  Hence  we  have  no  single-processor  data  on  which  to  base  speedup  estimates. 
Examination  of  the  available  data  shows  that  processor  utilization  is  best  for  p  of  IG  32  or  less, 
consistent  with  the  previou.s  results  for  scenes  of  these  sizes.  Note  that  perfoririance  of  the  Plato 
scone  peaks  out  at  G1  processors  and  then  declines  as  the  communication  overhead  becomes  domi¬ 
nant.  Careful  choice  of  buffer  sizes  can  boost  the  Plato  performance  on  G1  and  128  nodes  to  about 
125000  triangles/second. 

7  Considerations  for  Shared  Memory  Architectures 

Although  the  algorithm  described  in  Section  3  was  designed  specifically  for  distributed  memory 
machines,  it  can  be  readily  adapted  for  shared  memory  architectures.  Wo  assume  that  viable  shared 
memory  systems  would  support  an  efficient  mechanism  for  imiilementing  critical  sections  on  shared 
variables.  Given  this,  the  basic  structure  of  the  algorithm  remains  the  same,  with  internrnci'ssor 
communication  taking  place  through  shared  data  structures  rather  than  with  messages. 

Instead  of  partitioning  the  triangles  in  round-robin  fasliion  and  assigning  them  to  particular 
processors,  they  are  jilaced  in  a  shared  list  or  array.  When  a  ]>r(rcessor  nc'eds  a  trianele  to  uork 
on,  it  grabs  the  next  one  from  the  list,  in  typical  .self-scheduled  fashion.  The  overhead  for  h-tdiing 
the  triangle  is  the  time  it  takes  to  lock  the  list  index  variable,  read  the  curtent  value,  increment  it. 
and  unlock  it.  Presumably  this  ran  be  done  in  a  few  instruction  times  given  suitable  an  hilin  t  iii al 
support.  Some  wait  time  may  be  incurred  if  another  processor  already  has  t  he  variabh'  locked.  I'his 
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Figure  12:  Execution  time  as  a  function  of  buffer  depth  for  the  (a)  Plato  and  (b)  LDEF  scenes,  d 
ranges  from  2  to  1.25o.  The  vertical  bars  indicate  d  =  v/2. 
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Figure  13:  Rendering  rates  for  the  LDEF  and  Plato  scenes. 

should  not  be  a  problem  for  moderate  numbers  of  processors,  since  the  time  to  process  a  triangle 
would  be  much  larger  than  the  time  required  to  fetch  and  update  the  list  index. 

The  other  major  shared  data  structure  is  the  frame  buffer.  A  naive  approach  would  be  to 
let  processors  rasterize  triangles  directly  into  the  frame  buffer  after  transforming  them  into  screen 
coordinates.  But  since  many  processors  would  be  doing  this  simultaneously,  there  would  be  memory 
conflicts  when  triangles  overlapped  other  triangles  in  the  frame  buffer.  A  poor  solution  would  be 
to  lock  the  entire  frame  buffer  for  the  duration  of  the  rasterization  step,  but  that  would  effectively 
serialize  the  rasterization  phase  of  the  computation.  A  better  solution  is  to  partition  the  frame 
buffer  into  p  segments.  Then  triangles  could  be  split  into  trapezoids  as  in  our  original  algorithm. 
But  instead  of  sending  the  trapezoids  to  other  processors,  they  would  be  placed  on  a  shared  list  of 
trapezoids  needing  to  be  rasterized.  There  would  be  one  trapezoid  list  per  frame  buffer  segment. 
After  processing  one  or  more  triangles,  a  processor  would  grab  an  unlocked  frame  buffer  segment 
and  process  all  of  the  outstanding  trapezoids  queued  for  that  segment.  Because  there  are  as 
many  segments  as  there  are  processors,  at  least  one  will  always  be  unlocked.  By  not  tying  frame 
buffer  segments  to  particular  processors,  load  balancing  will  be  automatic  and  performance  should 
be  better  than  the  distributed  memory  version  of  the  algorithm.  As  before,  the  overhead  for 
maintaining  the  trapezoids  lists  and  locking  and  unlocking  frame  buffer  segments  should  be  small 
compared  to  the  cost  of  the  rasterization  computations. 

Thus,  the  shared  memory  version  of  the  algorithm  becomes: 
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Until  done 


If  triemgles  remain 

Select  the  next  triangle 
Shade  the  triangle 

Transform,  back  face  cull,  and  clip 
Split  into  trapezoids 

Insert  the  trapezoids  onto  the  trapezoid  lists 

Find  an  unlocked  frame  buffer  segment  with  outstanding 
trapezoids  (if  any) 

Rasterize  all  of  the  trapezoids  in  that  list 


Continue 

Termination  of  the  algorithm  is  also  simpler  in  the  shared  memory  version.  Each  processor  must 
certify  when  it  has  finished  working  on  its  last  triangle.  This  occurs  when  a  processor  checks  firir 
the  next  triangle  and  none  remain.  After  all  proces.sors  have  finished  their  last  triangle,  then  wlien 
all  of  the  trapezoid  lists  become  empty  and  all  of  the  frame  buffer  segments  are  unlocked,  rendering 
is  complete. 

Modification  of  the  performance  model  for  the  shared  memory  algorithm  is  straightforward. 
An  additional  nonlinear  term  is  needed  to  model  contention  for  the  triangle  list  index  variable. 
Message  passing  terms  in  the  distributed  memory  model  are  rt'placc'd  with  terms  which  reflect  the 
time  needed  to  update  the  trapezoid  lists  (including  contention)  and  to  .search  for  unlocked  frame 
buffer  segments  with  outstanding  trapezoids. 

8  Conclusion 

In  this  paper  we  liave  described  a  parallel  rendering  algorithm  for  MlMl)  com|>tit('r  am  liltect  uies. 
The  algorithm  is  attraclive  for  its  exploitation  of  both  obp'ct  ami  pixtd  levc'l  itarallelism.  We  h<iv<' 
given  a  theoretical  analysis  of  its  performance  on  distributed  iiH'inory,  message  passing  systcmis. 
and  compared  this  with  an  actual  implementation  on  the  Intel  il’S('/S(iO  hyperciilx',  Oiir  results 
show  that  the  algorithm  is  a  viable  means  of  achir’ving  a  highly  jrarallel  renderer.  Sc  alability  is 
limited  primarily  by  communication  costs,  which  increa.se  as  a  function  of  tlie  number  of  proces¬ 
sors.  Expe^^ted  improvements  in  communication  speed  and  optimization  of  the  transformation  and 
rasterization  software  will  allow  this  algorithm  to  compete  favorably  with  other  high-ix'iformance 
rendering  systems. 
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